{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# COMPSCI 389: Introduction to Machine Learning\n", "# Topic 3.0 Nearest Neighbor for Regression\n", "\n", "In this notebook we will create our first ML algorithms for regression.\n", "\n", "As an example, we will apply the Nearest Neighbor algorithm to the GPA data set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, here are the import statements that we use in this notebook:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd # For representing data sets\n", "from sklearn.base import BaseEstimator # For creating our nearest neighbor model\n", "import numpy as np # For representing arrays\n", "import timeit # For timing different function calls\n", "from sklearn.neighbors import KDTree # For efficient nearest-neighbor searches (more on this below!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's load the GPA data set and display it as a reminder of what it contains." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
physicsbiologyhistoryEnglishgeographyliteraturePortuguesemathchemistrygpa
0622.60491.56439.93707.64663.65557.09711.37731.31509.801.33333
1538.00490.58406.59529.05532.28447.23527.58379.14488.642.98333
2455.18440.00570.86417.54453.53425.87475.63476.11407.151.97333
3756.91679.62531.28583.63534.42521.40592.41783.76588.262.53333
4584.54649.84637.43609.06670.46515.38572.52581.25529.041.58667
.................................
43298519.55622.20660.90543.48643.05579.90584.80581.25573.922.76333
43299816.39851.95732.39621.63810.68666.79705.22781.01831.763.81667
43300798.75817.58731.98648.42751.30648.67662.05773.15835.253.75000
43301527.66443.82545.88624.18420.25676.80583.41395.46509.802.50000
43302512.56415.41517.36532.37592.30382.20538.35448.02496.393.16667
\n", "

43303 rows × 10 columns

\n", "
" ], "text/plain": [ " physics biology history English geography literature Portuguese \\\n", "0 622.60 491.56 439.93 707.64 663.65 557.09 711.37 \n", "1 538.00 490.58 406.59 529.05 532.28 447.23 527.58 \n", "2 455.18 440.00 570.86 417.54 453.53 425.87 475.63 \n", "3 756.91 679.62 531.28 583.63 534.42 521.40 592.41 \n", "4 584.54 649.84 637.43 609.06 670.46 515.38 572.52 \n", "... ... ... ... ... ... ... ... \n", "43298 519.55 622.20 660.90 543.48 643.05 579.90 584.80 \n", "43299 816.39 851.95 732.39 621.63 810.68 666.79 705.22 \n", "43300 798.75 817.58 731.98 648.42 751.30 648.67 662.05 \n", "43301 527.66 443.82 545.88 624.18 420.25 676.80 583.41 \n", "43302 512.56 415.41 517.36 532.37 592.30 382.20 538.35 \n", "\n", " math chemistry gpa \n", "0 731.31 509.80 1.33333 \n", "1 379.14 488.64 2.98333 \n", "2 476.11 407.15 1.97333 \n", "3 783.76 588.26 2.53333 \n", "4 581.25 529.04 1.58667 \n", "... ... ... ... \n", "43298 581.25 573.92 2.76333 \n", "43299 781.01 831.76 3.81667 \n", "43300 773.15 835.25 3.75000 \n", "43301 395.46 509.80 2.50000 \n", "43302 448.02 496.39 3.16667 \n", "\n", "[43303 rows x 10 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = pd.read_csv(\"https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv\", delimiter=',') # Read GPA.csv, assuming numbers are separated by commas\n", "# df = pd.read_csv(\"data/GPA.csv\", delimiter=',')\n", "display(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before continuing, let's split the data set into `X` (all but the last column = 9 exam scores) and `y` (the last column = GPA) components:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
physicsbiologyhistoryEnglishgeographyliteraturePortuguesemathchemistry
0622.60491.56439.93707.64663.65557.09711.37731.31509.80
1538.00490.58406.59529.05532.28447.23527.58379.14488.64
2455.18440.00570.86417.54453.53425.87475.63476.11407.15
3756.91679.62531.28583.63534.42521.40592.41783.76588.26
4584.54649.84637.43609.06670.46515.38572.52581.25529.04
..............................
43298519.55622.20660.90543.48643.05579.90584.80581.25573.92
43299816.39851.95732.39621.63810.68666.79705.22781.01831.76
43300798.75817.58731.98648.42751.30648.67662.05773.15835.25
43301527.66443.82545.88624.18420.25676.80583.41395.46509.80
43302512.56415.41517.36532.37592.30382.20538.35448.02496.39
\n", "

43303 rows × 9 columns

\n", "
" ], "text/plain": [ " physics biology history English geography literature Portuguese \\\n", "0 622.60 491.56 439.93 707.64 663.65 557.09 711.37 \n", "1 538.00 490.58 406.59 529.05 532.28 447.23 527.58 \n", "2 455.18 440.00 570.86 417.54 453.53 425.87 475.63 \n", "3 756.91 679.62 531.28 583.63 534.42 521.40 592.41 \n", "4 584.54 649.84 637.43 609.06 670.46 515.38 572.52 \n", "... ... ... ... ... ... ... ... \n", "43298 519.55 622.20 660.90 543.48 643.05 579.90 584.80 \n", "43299 816.39 851.95 732.39 621.63 810.68 666.79 705.22 \n", "43300 798.75 817.58 731.98 648.42 751.30 648.67 662.05 \n", "43301 527.66 443.82 545.88 624.18 420.25 676.80 583.41 \n", "43302 512.56 415.41 517.36 532.37 592.30 382.20 538.35 \n", "\n", " math chemistry \n", "0 731.31 509.80 \n", "1 379.14 488.64 \n", "2 476.11 407.15 \n", "3 783.76 588.26 \n", "4 581.25 529.04 \n", "... ... ... \n", "43298 581.25 573.92 \n", "43299 781.01 831.76 \n", "43300 773.15 835.25 \n", "43301 395.46 509.80 \n", "43302 448.02 496.39 \n", "\n", "[43303 rows x 9 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "0 1.33333\n", "1 2.98333\n", "2 1.97333\n", "3 2.53333\n", "4 1.58667\n", " ... \n", "43298 2.76333\n", "43299 3.81667\n", "43300 3.75000\n", "43301 2.50000\n", "43302 3.16667\n", "Name: gpa, Length: 43303, dtype: float64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Split the inputs from the outputs\n", "X = df.iloc[:,:-1]\n", "y = df.iloc[:,-1]\n", "display(X)\n", "display(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that our goal is to use the available data to make predictions for new data points called *queries*. These queries come with input features (e.g., exam scores), but not a label (e.g., GPA). Our goal is to create an algorithm that can predict the corresponding label given the query features.\n", "\n", "In this notebook we will use the GPA data set to solve the regression problem of predicting student GPAs from exam scores." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scikit-Learn Models\n", "\n", "Before creating and coding up our first ML algorithms, let's review a general framework for implementing ML algorithms. Specifically, we will consider a form for defining ML algorithms that is compatible with scikit-learn.\n", "\n", "This framework relies on the term **model**. In ML, a model is a mechanism with the following abilities.\n", "1. It can be initialized using data. This step is sometimes called \"**training** the model using the data\" or \"**fitting** the model to the data\". During this training phase the algorithm processes the provided data to pre-compute values that will be useful for making future predictions.\n", "2. It can be given a query (one or more sets of input features), and it will produce predictions of the labels for the provided inputs. When the model makes a prediction, we say that the model is **run** or **executed**.\n", "\n", "We will follow the scikit-learn template for representing models and ML algorithms. Creating a new ML algorithm means implementing the following functions:\n", "1. `__init__`: A constructor that can be used to set hyperparameters that change the behavior of the algorithm.\n", "2. `fit(self, X, y)`: The function for fitting the model to the data (training the model given the data). This function is called to allow the ML algorithm to pre-process the data so that it can more quickly respond to future queries.\n", " - `X`: A 2D array-like structure (e.g., DataFrame) representing the features. Each row is a point and each column is a feature.\n", " - `y`: A 1D array-like structure (e.g., Series) representing the target values.\n", " - **Returns**: This function returns `self` to simplify chaining together operations.\n", "3. `predict(self, X)`: The function for producing predictions given queries.\n", " - `X`: A 2D array-like structure representing the data for which predictions are to be made. Each row in `X` is a sample, and each column is a feature.\n", " - **Returns**: A numpy array of predicted labels/values.\n", "\n", "For example, here is a template of the code to create a new ML algorithm that is compatible with scikit-learn:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "class CustomMLAlgorithm(BaseEstimator):\n", " def __init__(self, param1=1, param2=2):\n", " # Initialization code\n", " self.param1 = param1 # If you have hyperparameters of your algorithm, they are set here like param1 and param2\n", " self.param2 = param2\n", "\n", " def fit(self, X, y):\n", " # Training code\n", " # Implement your training algorithm here\n", " return self\n", "\n", " def predict(self, X):\n", " # Prediction code\n", " # Implement your prediction algorithm here\n", " return np.zeros(len(X)) # For now we just return all zeroes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then create a model with:\n", "\n", "`model = CustomMLAlgorithm()`\n", "\n", "If `X` is a DataFrame containing the input features and `y` is a Series containing the resulting labels, we can train the model with:\n", "\n", "`model.fit(X,y)`\n", "\n", "If `query` is a DataFrame containing the inputs for which we would like to predict the labels, we can get the predictions with:\n", "\n", "`predictions = model.predict(query)`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Nearest Neighbor \n", "\n", "*Nearest Neighbor* (NN) is a particularly simple yet effective ML algorithm based on the core idea:\n", "\n", "> When presented with a query, find the data point (row) that is most similar to the query and give the label associated with this most-similar point as the prediction.\n", "\n", "We map this to the scikit-learn functions by having `fit` store the data and `predict` handle all of the processing. Predict does the following:\n", "\n", "1. Loop over each row in the training data, computing the Euclidean distance between the query and the row.\n", "2. Find the rows with the smallest distance to the query feature vector (if there are ties, there can be more than one).\n", "2. Create an array holding the labels from these rows.\n", "3. Return an arbitrary element of the array.\n", "\n", "Here is a naive implementation of NN:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "class NaiveNearestNeighbor(BaseEstimator):\n", " def fit(self, X, y):\n", " # Convert X and y to NumPy arrays if they are DataFrames. \n", " # This makes fit compatible with numpy arrays or DataFrames\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", " if isinstance(y, pd.Series):\n", " y = y.values\n", "\n", " # Store the training data and labels.\n", " self.X_data = X\n", " self.y_data = y\n", " return self\n", "\n", " def predict(self, X):\n", " # Convert X to a NumPy array if it's a DataFrame\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", "\n", " # We will iteratively load predictions, so it starts empty\n", " predictions = []\n", " \n", " # Loop over rows in the query\n", " for x in X:\n", " # Compute distances from x to all points in X_data. We combine the following steps into one line:\n", " # differences = self.X_data - x # Subtract x from each row of X\n", " # squared_differences = differences ** 2 # Square the differences\n", " # sum_squared_differences = np.sum(squared_differences, axis=1) # Sum each row, giving one number per column\n", " # distances = np.sqrt(sum_squared_differences) # Take the square root of the sum_squared_differences, to get the Euclidean distance.\n", " distances = np.sqrt(np.sum((self.X_data - x) ** 2, axis=1))\n", " \n", " # Find the nearest neighbors (handling ties)\n", " min_distance = np.min(distances)\n", " nearest_neighbors = np.where(distances == min_distance)[0] # np.where returns a tuple of arrays, one for each dimension. E.g., if we gave a 2d array, [0] would be the row index and [1] would be the col index. Distances is 1-D, so there is only one element in the tuple here, hence we pass [0]\n", " nearest_label = self.y_data[nearest_neighbors[0]] # You could return a random element of this array. For now we return the first element.\n", "\n", " # Append this label to predictions\n", " predictions.append(nearest_label)\n", "\n", " # Return the array of predictions we have created\n", " return np.array(predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's apply this to the GPA data set, which we already loaded into a DataFrame `df`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1.33333 2.98333]\n" ] } ], "source": [ "# Create the Nearest Neighbor Model\n", "model = NaiveNearestNeighbor()\n", "\n", "# Call fit to train the model (in this case, just store the data set)\n", "model.fit(X,y)\n", "\n", "# Create two query points (in reality these would be new applicants)\n", "query = df.head(2).iloc[:,:-1]\n", "\n", "# Get predictions for the query points\n", "predictions = model.predict(query)\n", "print(predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It works! Well, it produces predictions. Soon we will investigate how *good* these predictions are. First, let's make our algorithm more efficient.\n", "\n", "### Optimizing Nearest Neighbor Search\n", "\n", "The current predict method in our Nearest Neighbor algorithm loops over the entire dataset for each query point, which is inefficient for large datasets.\n", "\n", "Instead, we will use data structures designed to make finding nearest neighbors efficient: K-D Trees (or KD Trees) and Ball Trees\n", "\n", "- **Purpose**: To store points and enable efficient search for the closest points to a query.\n", "- **K-D Trees**: Effective for low-dimensional data, but performance decreases with higher dimensions.\n", "- **Ball Trees**: Better suited for higher-dimensional spaces.\n", "\n", "### Implementation with Scikit-Learn\n", "\n", "We will update our algorithm to use a K-D Tree, leveraging sklearn.neighbors from scikit-learn. This module provides optimized implementations of these data structures, balancing preprocessing time, search speed, and accuracy (exact vs. approximate nearest neighbors)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "class NearestNeighbor(BaseEstimator):\n", " def fit(self, X, y):\n", " # Convert X and y to NumPy arrays if they are DataFrames. This makes fit compatible with numpy arrays or DataFrames\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", " if isinstance(y, pd.Series):\n", " y = y.values\n", "\n", " # Store the training data and labels.\n", " self.X_data = X\n", " self.y_data = y\n", " \n", " # Create a KDTree for efficient nearest neighbor search\n", " self.tree = KDTree(X)\n", "\n", " return self\n", "\n", " def predict(self, X):\n", " # Convert X to a NumPy array if it's a DataFrame\n", " if isinstance(X, pd.DataFrame):\n", " X = X.values\n", "\n", " # Query the tree for the nearest neighbors of all points in X\n", " dist, ind = self.tree.query(X, k=1) # ind will be a 2D array where ind[i,j] is the index of the j'th nearest point to the i'th row in X.\n", "\n", " # Extract the nearest labels\n", " return self.y_data[ind[:,0]] # ind[:,0] are the indices of the nearest neighbors to each query (each row in x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see if this gives the same answer as our naive implementation:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1.33333 2.98333]\n" ] } ], "source": [ "model = NearestNeighbor()\n", "model.fit(X,y)\n", "predictions = model.predict(query)\n", "print(predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great, they're the same! Note that the two implementations won't necessarily produce the same output when there is more than one nearest point. Next, let's compare their runtimes. Note that this comparison may vary with the size, dimension, and sparsity of the data set." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average runtime for NaiveNearestNeighbor: 0.003582715000011376 seconds\n", "Average runtime for NearestNeighbor: 0.08647038499999325 seconds\n" ] } ], "source": [ "def run_model(model_class):\n", " model = model_class()\n", " model.fit(X, y)\n", " predictions = model.predict(query)\n", " return predictions\n", "\n", "# Number of trials\n", "numTrials = 100\n", "\n", "# Time the NaiveNearestNeighbor\n", "time_naive = timeit.timeit(lambda: run_model(NaiveNearestNeighbor), number=numTrials)\n", "print(f\"Average runtime for NaiveNearestNeighbor: {time_naive / numTrials} seconds\")\n", "\n", "# Time the NearestNeighbor\n", "time_efficient = timeit.timeit(lambda: run_model(NearestNeighbor), number=numTrials)\n", "print(f\"Average runtime for NearestNeighbor: {time_efficient / numTrials} seconds\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question**: Why do you tink our naive algorithm was faster?\n", "\n", "It could be that the overhead cost of building the KD-tree is not worth it when only running two queries. Let's run more queries:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average runtime for NaiveNearestNeighbor: 9.231791199999861 seconds\n", "Average runtime for NearestNeighbor: 0.10479070000292268 seconds\n" ] } ], "source": [ "# Let's run 5,000 of the points through as queries\n", "query = df.iloc[1:5000,:-1]\n", "\n", "# 100 trials is now quite slow. Let's run 1. You could increase this to get a more accurate estimate of runtime.\n", "numTrials = 1\n", "\n", "# Time the NaiveNearestNeighbor\n", "time_naive = timeit.timeit(lambda: run_model(NaiveNearestNeighbor), number=numTrials)\n", "print(f\"Average runtime for NaiveNearestNeighbor: {time_naive / numTrials} seconds\")\n", "\n", "# Time the NearestNeighbor\n", "time_efficient = timeit.timeit(lambda: run_model(NearestNeighbor), number=numTrials)\n", "print(f\"Average runtime for NearestNeighbor: {time_efficient / numTrials} seconds\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It should now be clear that data structures like K-D Trees can speed up nearest neighbor searches when we run many queries, but there is an overhead cost associated with constructing the K-D Tree that makes it less efficient when running a single query (or small number of queries)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 2 }